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Abstract —Sparse representation models a signal as a linear 
combination of a small number of dictionary atoms. As a gener¬ 
ative model, it requires the dictionary to be highly redundant 
in order to ensure both a stable high sparsity level and a 
low reconstruction error for the signal. However, in practice, 
this requirement is usually impaired by the lack of labelled 
training samples. Fortunately, previous research has shown that 
the requirement for a redundant dictionary can be less rigorous 
if simultaneous sparse approximation is employed, which can be 
carried out by enforcing various structured sparsity constraints 
on the sparse codes of the neighboring pixels. In addition, 
numerous works have shown that applying a variety of dictionary 
learning methods for the sparse representation model can also im¬ 
prove the classification performance. In this paper, we highlight 
the task-driven dictionary learning algorithm, which is a general 
framework for the supervised dictionary learning method. We 
propose to enforce structured sparsity priors on the task-driven 
dictionary learning method in order to improve the performance 
of the hyperspectral classification. Our approach is able to 
benefit from both the advantages of the simultaneous sparse 
representation and those of the supervised dictionary learning. 
We enforce two different structured sparsity priors, the joint 
and Laplacian sparsity, on the task-driven dictionary learning 
method and provide the details of the corresponding optimization 
algorithms. Experiments on numerous popular hyperspectral 
images demonstrate that the classification performance of our 
approach is superior to sparse representation classifier with 
structured priors or the task-driven dictionary learning method. 

Index Terms —Sparse representation, supervised dictionary 
learning, task-driven dictionary learning, joint sparsity, Lapla¬ 
cian sparsity, hyperspectral imagery classification 


I. Introduction 

C LASSIFICATION on Hyperspectral Imagery (HSI) is 
becoming increasingly popular in remote sensing. No¬ 
table applications include military aerial surveillance Q- 
mineral identification and material defects detection 0. 
However, numerous difficulties impede the improvement of 
HSI classification performance. For instance, the high dimen¬ 
sionality of HSI pixels introduce the problem of the ‘curse 
of dimensionality’ 0. and the classifier is always confronted 
with the overfitting problem due to the small number of 

X.Sun and T. D. Tran are with the Department of Electrical and Computer 
Engineering, The Johns Hopkins University, Baltimore, MD 21218 USA (e- 
mail: xsun9@jhu.edu; trac@jhu.edu). This work has been partially supported 
by NSF under Grants CCF-1117545, ARO under Grants 60219-MA, and ONR 
under grant N000141210765. 

N. M. Nasrabadi is with U.S. Army Research Laboratory, Adelphi, MD 
20783 USA (e-mail: nnasraba@arl.army.mil). 


labelled samples. Additionally, most HSI pixels are indiscrim- 
inative since they are undesirably highly coherent |6|. In the 
past few decades, numerous classification techniques, such as 
SVM [7], k-nearest-neighbor classifier^, multimodel logistic 
regression |9] and neural network fT0|7 have been proposed 
to alleviate some of these problems to achieve an acceptable 
performance for HSI classification. 

A. Sparse Representation for HSI classification 

More recently, researchers have focused attention on de¬ 
scribing the high dimensional data as a sparse linear com¬ 
bination of dictionary atoms. Sparse representation classifier 
(SRC) was proposed in GD and has been successfully applied 
to a wide variety of applications, such as face recognition GD- 
visual tracking GD speech recognition GD and aerial image 
detection GD- SRC has also been applied to HSI classification 
by Chen et. al. GD where a dictionary was constructed by 
stacking all the labelled samples. Success of SRC requires that 
the high dimensional data belonging to the same class to lie 
in a low dimensional subspace. The outstanding classification 
performance is due to the robustness of sparse recovery, which 
is largely provided by the high redundancy and low coherency 
of the dictionary atoms. A low reconstruction error and a 
high sparsity level can be achieved if the designed dictionary 
satisfies the above properties. Unfortunately, in practice, the 
HSI dictionary usually does not have the above properties due 
to the small number of bluehighly correlated labelled training 
samples ©• 

Due to these undesired properties of the HSI dictionary, the 
sparse recovery can become unstable and unpredictable such 
that even pixels belonging to the same class can have totally 
different sparse codes. The problem induced by the high- 
coherency of the dictionary atoms, which can be alleviated 
through decreasing the variation between the sparse codes 
of the hyperspectral pixels that belong to the same class. 
In HSI, pixels that are spatially close to each other usually 
have similar spectral features and belong to the same class. 
Previous research has shown that the sparse codes of neigh¬ 
boring pixels can become similar by enforcing a structured 
sparsity constraint (prior). The simultaneous sparse recovery 
is analytically guaranteed to achieve a sparser solution and 
a lower reconstruction error with a smaller dictionary fT6| . 
A variety of structured sparsity priors are proposed in the 
literature GD that are capable of generating different desired 
sparsity patterns for the sparse codes of neighboring pixels. 
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The joint sparsity prior GD assumes that the features of all the 
neighboring pixels lie in the same low dimensional subspace 
and all the corresponding sparse codes share the same set 
of dictionary atoms. Therefore, the sparse codes have a row 
sparsity pattern, where only a few rows of the sparse codes are 
nonzero (18) , (T9) . The collaborative group sparsity prior (20) 
enforces the coefficients to have a group-wise sparsity pattern, 
where the coefficients within each active group are dense. 
The collaborative hierarchical sparsity prior ED enforces the 
sparse codes to be not only group-wise sparse, but also sparse 
within each active group. The low rank prior (22) assumes 
that the neighboring pixels are linearly dependent. It does 
not necessary lead the coefficients to be sparse, which is 
detrimental for a good classification. However, the low rank 
group prior proposed in ED is able to enforce both a group 
sparsity prior and a low rank prior on the sparse codes by 
forcing the same group of dictionary atoms to be active if 
and only if the corresponding neighboring pixels are linearly 
dependent. The Laplacian sparsity prior (23) uses a Laplacian 
matrix to describe the degree of similarity between the neigh¬ 
boring pixels. The neighboring pixels that have less spectral 
features in common are less encouraged to have a similar 
sparse codes. It has been shown that all the structured sparsity 
priors are capable of obtaining a smoother classification map 
and improving the classification performance ED- 


from different classes. It increases the discriminability of the 
sparse codes by decreasing the coherency of the atoms from 
different classes. The label consistent K-SVD (LC-KSVD) 
ED optimizes the dictionary and classifier’s parameter by 
minimizing the summation of reconstruction and classification 
errors. It combines the dictionary and classifier’s parameter 
into a single parameter space, which makes it possible for the 
optimization procedure to be much simpler than those used in 
classical SRC. However, a desired and accurate solution is not 
guaranteed (32) because the cost function can be minimized 
by decreasing the reconstruction error while the classification 
error is increased. A bilevel optimization formulation would 
be more appropriate (33) , where the update of the dictionary is 
driven by the minimization of the classification error. The task- 
driven dictionary learning (TDDL) (27) exploits this idea with 
theoretical proof and demonstrates a superior performance. 
The supervised translation-invariant sparse coding, which uses 
the same scheme as TDDL, is developed independently by 
(34) . It is a more general framework that can be applied not 
only to classification, but also nonlinear image mapping, digi¬ 
tal art authentication and compressive sensing. More recently, 
the group sparsity prior is enforced on a single measurement 
and the corresponding TDDL optimization algorithm is devel¬ 
oped in (35) in order to improve the performance of region 
tagging. 


B. Dictionary Learning for Sparse Representation 

In the classical SRC, the dictionary is constructed by 
stacking all the training samples. The sparse recovery can be 
computationally burdensome when the training set is large. 
Besides, the dictionary constructed in this manner can neither 
be optimal for reconstruction purposes nor for classification 
of signals. Previous literature have shown that a dictionary 
can be trained to have a better representation of the dataset. 
Unsupervised dictionary learning methods, such as the method 
of optimal direction (MOD) (24) , K-SVD [251 and online dic¬ 
tionary learning (26) , are able to improve the signal restoration 
performance of numerous applications, such as compressive 
sensing, signal denoising and image inpainting. 

However, the unsupervised dictionary learning method is 
not suitable for solving classification problems since a lower 
reconstruction error does not necessarily lead to a better classi¬ 
fication performance. In fact, it is observed that the dictionary 
can have an improved classification result by sacrificing some 
signal reconstruction performance (27) . Therefore, supervised 
dictionary learning methods [28] are proposed to improve 
the classification result. Unlike the unsupervised dictionary 
learning, which only trains the dictionary by pursuing a 
lower signal reconstruction error, the supervised learning is 
able to directly improve the classification performance by 
optimizing both the dictionary and classifier’s parameter si¬ 
multaneously. The discriminative dictionary learning in (29) 
minimizes the classification error of SRC by minimizing the 
reconstruction error contributed by the atoms from the correct 
class and maximizing the error from the remaining classes. 
The incoherent dictionary learning in [30) uses SRC as the 
classifier and tries to eliminate the atoms shared by pixels 


C. Contributions 


In this paper, we propose a novel method that enforces 
the joint or Laplacian sparsity prior on the sparse recovery 
stage of TDDL. The existing dictionary learning methods 
have only been developed for reconstructing or classifying a 
single measurement. Therefore, it is advantageous to incorpo¬ 
rate structured sparsity priors into the supervised dictionary 
learning in order to achieve a better performance. This paper 
makes the following contributions: 

• We propose a new dictionary learning algorithm for 
TDDL with joint or Laplacian sparsity in order to ex¬ 
ploit the spatial-spectral information of HSI neighboring 
pixels. 

• We show experimentally that the proposed dictionary 
learning methods have a significantly better performance 
than SRC even when the dictionary is highly compact. 

• We also describe an optimization algorithm for solving 
the Laplacian sparsity recovery problem. The proposed 
optimization method is much faster than the modified 
feature sign search used in (23) . 


The remainder of the paper is organized as follows. In 
Section [II] a brief review of TDDL is given. In Section [III] 
we propose a modified TDDL algorithm with the joint sparsity 
prior. TDDL with the Laplacian prior and a new algorithm for 
recovering the Laplacian sparse problem are stated in Section 
IV In Section [V] we show that our method is superior to 


other HSI classification methods through experimental results 
on several HSI images. Finally, we provide our conclusion in 
Section lYll 
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II. Task-driven Dictionary Learning 

In TDDL (27), signals are represented by their sparse 
codes, which are then fed into a linear regression or logistic 
regression. Consider a pair of training samples (x, y), where 
x G M m is the HSI pixel, M is the number of spectral bands, 
and y G R K is a binary vector representation of the label of 
the sample x. K is the maximum class index. Pixel x can be 
represented by a sparse coefficient vector a(D, x) G R N with 
respect to some dictionary D G M MxAr consisting of N atoms 
by solving the optimization 

a(D, x ) = argmin||x-Dz||| + A||z||i + ^||z|||, (1) 

z 2 

,where A and e are the regularization parameters. A controls 
the sparsity level of the coefficients a. In our experiments, 
we set e to 0 since it does not affect the convergence of the 
algorithm and always gives satisfactory results. 

To optimize the dictionary, TDDL first defines a convex 
function £(D, W, {x^}^ 1 ) to describe the classification risk 
in terms of the dictionary atoms, sparse coefficients and the 
classifier’s parameter W. The function is then minimized as 
follows 

min£( D > w - {xi}?=i) = min/(D,W,{xi}f =1 )+^||W|||., 

D. W 13. W 2 

( 2 ) 

where p > 0 is a classifier regularization parameter to avoid 
overfitting of the classifier (36). The convex function / is 
defined as 


1 5 

/(D,W,{xjf =1 ) = J(yi,'W,a i (p,x i )), 


(3) 


where S is the total number of training samples and 
£(y^, W, a^(D, Xi)) is the classification error for a training 
pair which is measured by a linear regression, i.e. 

J(Yi, W, a<(D, x,)) = 1 11yi - Wa,||§. 

In the following part of the section, we omit the subscript 
i of a for notational simplicity. The dictionary D and the 
classifier parameter W are updated using a stochastic gradient 
descent algorithm, which has been independently investigated 
by (27), (34). The update rules for D and W are 


(4) 


( D^ t+1 ) = DW - p® ■ dC^/d D, 
j W (t+i) = w w _ p (t) . acw/aw, 


where t is the iteration index and p is the step size. The 
equations for updating the classifier parameter W is straight¬ 
forward since £(y, W, a(D, x)) is both smooth and convex 
with respect to W. We have 


dC 

dW 


= (Wet - y) a T + //W. 


(5) 


The updating equation for the dictionary can be obtained by 
applying error backpropagation, where the chain rule is applied 

dC dC da 
m da <9D 

The difficulty of acquiring a specific form of the above equa¬ 
tion comes from doL /dr>. Since the sparse coefficient a(D,x) 
is an implicit function of D, an analytic form of a with respect 


( 6 ) 


to D is not available. Fortunately, the derivative doc /dr> can still 
be computed by either applying optimality condition of elastic 
net (27) , [37 ]] or using fixed point differentiation (34) , [38]. 

We now focus on computing the derivative using the fixed 
point differentiation. As suggested in (38) , the gradient of Eq. 
0 reaches 0 at the optimal point a 

d|l x — Doj|| = A gH|i 

da ot=6t da 

Expanding Eq. ([7]), we have 



2D t (x — Da) 


= A • sign(a) 

ol=6l 


at=6t 


( 8 ) 


In order to evaluate da /ar >, the derivative of Eq. ^ with 
respect to each element D mn of the dictionary is required. 
Since the differentiation of the sign function is not well 
defined at zero points, we can only compute the derivative 
of Eq. (8j) at fixed points when a[ n ] ^ 0 (34) 


da a 
dD rnn 


(DlD A )- 


<9D^x 

dD mn 


SD^Da \ da A c 

dD mn J dDmn 

(9) 


where A and A c are the indices of the active and inactive 
set of a respectively. D mn G f is the (m, n) element of D. 
(DIDa)- 1 is always invertible since the number of active 
atoms |A| is always much smaller than the feature dimension 
M. 


III. TDDL WITH JOINT SPARSITY PRIOR 

We now extend TDDL by using a joint sparsity (JS) prior 
(TDDL-JS). The joint sparsity prior (18) , (19) enforces the 
sparse coefficients of the test pixel and its neighboring pixels 
within the neighborhood window to have row sparsity pattern, 
where all pixels are represented by the same atoms in the 
dictionary so that only few rows of the sparse coefficients 
matrix are nonzero. The joint sparse recovery can be solved 
by the following Lasso problem 

A = argmin ||X - BZ\\ 2 F + A||Z|| li2 , (10) 

where A, Z G M iVxP are sparse coefficient matrices and 
X = [xi,...,xp] G R MxP represents all the pixels within a 
neighborhood window centered on a test (center) pixel x c . De¬ 
fine the label of the center pixel as y c . P is the total number of 

p 

pixels within the neighborhood window. ||Z||i ?2 = 

i=l 

is the ^i ?2 -norm of Z. Z i G M lxP is the i th row of Z. Many 
sparse recovery techniques are able to solve Eq. (JTOj, such as 
the Alternating Direction Method of Multipliers (39) , Sparse 
Reconstruction by Separable Approximation (SpaRSA) (40) 
and Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) 



Once the sparse code A is obtained, the sparse codes a c 
of the center pixel x c is projected on each of the K decision 
planes of the classifier. The plane with the largest projection 
indicates the class that the center pixel x c belongs to, 

identity (x c ) = argmaxy/c = argmax(Wa c )/ c , (11) 

k k 
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where a c G R N is the sparse coefficients of the center pixel. 
In the training stage, it is expected that the projection of the 
decision plane corresponding to the class of the center pixel 
should be increased while other planes should be orthogonal 
to ol c . Therefore, given the training data (X,y c ), the classifi¬ 
cation error for the center pixel x c is defined as 

£(y c ,W,a c (D,X)) = ||y c — Wa c ||l + |||W|||., (12) 

In order to update the dictionary D, we need to apply a chain 
rule similar to the one in Eq. 

dC dC d A 

(13) 


3D 3 A 3D 

dA 


Now we focus on the difficult part ^ of Eq 
the fixed point differentiation on Eq. ( [TO] ), we 


e ha 1 


IX — DA 


dA 


= -A 


0||A| 


1,2 


A=A 


(14) 


0, the derivative is not well defined, so we set 




dA i 




L iVA 


D t (x - Da) = A 

v ) LII A ill2 II a jva 

Computing the derivative of Eq. © with respect to D 


(15) 


and transposing both sides 

0 |(X — DA) t D j 


0D„ 


wl,er li'- = ire 

Eq. |16j), we have 
(dX T t) 

vec —- A 

y dDmn 


where r = T | 


= A 


A, 1 A; 


ri 


dAj 

dDn 


T 


■ N A 


d ^-NA 

3D 


Algorithm 1 Stochastic gradient descent algorithm for task- 
driven dictionary learning with joint sparsity prior 

Require: Initial dictionary D and classifier W. Parameter A, 
p and to. 

for t = 1 to T do 

Draw one sample (X, y c ) from training set. 


5: 


Find sparse sparse code A according to Eq. ([TO]) 
Find the active set A and define N A = |A| 

A {i : || ||2 7^ 0, i G {1,..., N }}, 

where A^ is the i th row of A. 

Compute r G ’R N ^ PxN ^ p 


. Employing 
ave 


r = iv 


r ? = 


9 Tat a , 

a,A7 


A=A 3A 

In the following part of this section, we omit the fixed point 
notation. Eq. (14) is only differentiable when || A^ ||2 7 ^ 0, 
where A^ denotes the i th row of A. At points where || A; 


llAiiii 


, i 1 5 


,JVa, 


6: 


12 = 

= 0 . 

Denote A = A a G R JNaXP , where A is the active set such 
that A = {i : || A ^|| 2 + 0,i G {1,... ,jV}}, N A = |A|, A a is 
composed of active rows of A, and D is the active atoms of 
D. Expanding the derivative of Eq. (|T4]) on both sides on the 
feasible points, 

AT nT 


7: 


llAiHa 

where ® is the direct sum of matrices. 

Compute 7 G R NaP 

7 = (D t D 01 P + Ar) _T uec((WA - Y) T W). 

where vec(-) and W denote the vectorization operator 
and A columns of W respectively. 

Let (3 G R NxP . Set f3 A c = 0 and construct (3 A G 
RNaxp fast satisfies 


vec 


(A) 


= 7- 


(16) 

i = 1,..., N a . By vectorizing 


8: Choose the learning rate p t min(p, Py). 

9: Update the parameters by gradient projection step 

W ^ W -p t ((Woe - y)aJ + /iW), 

D <- D - pt(- D/3A t + (X - DA)/3 t ), 

and normalize every column of D^ +1 ^ with respect to 
^ 2 -norm. 

10: end for 

it: return D and W. 


<9D t D <9 A t 


dD„ 


dD n 


d t d 


= A • Tvec 


3A T 


TDDL WITH LAPLACIAN SPARSITY PRIOR 


3D mn The joint sparsity prior is a relatively stringent constraint on 


■ Yy- 


(17) 


From Eq. (17), we reach the 


vectorization form of the derivative of A with respect to D n 
given as 


vec \ 


3A T 

dDmn 


( 61 


D 0 Ip + XT 


-1 


vec \ 


3DD 

dDmn 


(18) 

Now we can update the dictionary element-wise using Eq. 
( p~ 8 ) . In order to reach a more concise form for updating 
the dictionary, we perform algebraic transformations on Eq. 
( p~3) and Eq. ( fl~ 8 ] ), which are illustrated in Appendix |VII| We 
illustrate the overall optimization for TDDL-JS in Algorithm 
[I] It should be noted that in the Algorithm [l] we define A = 
[0,... ,a c ,... ,0] G R NxP and Y = [0,..., y,..., 0] G 

jrpKxP 


the sparse codes since it assumes that all the neighboring pixels 
have the same support as the center pixel. The assumption 
of the joint sparsity prior can easily be violated on non- 
homogeneous regions, such as a region that contains pixels 
from different classes. This makes choosing a proper neighbor- 
<9Koo6> ^indow size a difficult problem. When the window size 
dD^QQ warge, the sparse codes of the non-homogeneous regions 
within the window are indiscriminative. On the other hand, the 
sparse codes are not stable if the window size is chosen to be 
too small. Ideally, we hope that the performance is insensitive 
to both the choice of the window size and the topology of the 
image. To achieve this requirement, we propose to enforce the 
Laplacian sparsity (LP) prior (TDDL-LP) on the TDDL, where 
the degree of similarity between neighboring pixels can be 
utilized to push the sparse codes of the neighboring pixels that 
belong to the same class to be similar, instead of enforcing all 
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the neighboring pixels to have a similar sparse codes blindly. 
The corresponding Lasso problem can be stated as follows 

p 

A = argmin ||X - BZ\\ 2 2 + A||Z||i + 7 ^c jj ||Z i - 2?\\l 

i,3 

(19) 

where Z 2 and Z J denote the i th and j th columns of Z. Cij is 
a weight whose value is proportional to the spectral similarity 
of X 2 and X J , which are the i th and j th columns of X. 7 is a 
regularization parameter. 

The Laplacian sparse recovery described by Eq. © in 
(23) is able to discriminate pixels from different classes by 
defining an appropriate weighting matrix C = [c^] G M PxP . 
Additionally, it enforces both the support and the magnitude 
of sparse coefficients of similar spectral pixels to be similar, 
whereas the joint sparsity prior enforces sparse coefficients of 
all the pixels within the neighborhood window to have the 
same support. Eq. © can be reformulated as 

A = argmm ||X - DZ||| + A||Z||i + 7 tr(ZLZ T ), (20) 

where L = B — C G M PxP is the Laplacian matrix (42) . 
B = [bij] G M PxP is a diagonal matrix such that bn = c ij- 

3 

In this paper, we adopt the method of Sparse Reconstruction 
by Separable Approximation (SpaRSA) (TT) , (40) to solve the 
Laplacian sparse coding problem. 


A. Sparse Recovery Algorithm 

A modified feature sign search (23) is capable of solving 
the optimization problem ( [20] ). It uses coordinate descent to 
update each column of A iteratively. Although it gives plausi¬ 
ble performance for the SRC-based HSI classification © 
it demands a high computational cost. The SpaRSA-based 
method can achieve a similar optimal solution of Eq. ( [20] ) 
while being less computational burdensome. Despite the fact 
that our previous work © has shown that the performance of 
the SRC-based approach for HSI classification can be largely 
influenced by the choice of specific optimization technique, we 
found that such influence is reasonably small when employing 
the dictionary-learning-based approach. Therefore, we use 
a SpaRSA-based method to solve the sparse recovery for 
the Laplacian sparsity prior. Although, SpaRSA is originally 
designed to solve the optimization of single-signal case, it can 
be easily extended to tackle the problem with multiple signals, 
such as the collaborative hierarchical Lasso (C-Hilasso) © 

SpaRSA is able to solve optimization problems that have 
the following form 


min / (A) 

AeR^x-P j 


•AV-(A), 


( 21 ) 


where / : M 7VxP g M is a convex and smooth function, 
ip : R NxP M is a separable regularizer and A is the 
regularization parameter. In the particular case of the Laplacian 
sparse recovery, the regularizer ip is chosen to be the i\— norm, 
i.e. ip( A) = ||A||i, and the convex function / is set as 

/ (A) = ||X — DA||p + ■ytr (ALA 1 "). ( 22 ) 


Algorithm 2 Sparse recovery for Laplacian sparsity prior 
using SpaRSA 

Require: Dictionary D, constants po >0, 0 < rj m i n < r/ max , 

p > 1 

l: Set t = 0 and A^ 0 ^ = 0 

2: repeat 

3: choose T]( ^ G [Pmim ?7max] 

4: compute UW <- AW - ^V/(AW). 

5: repeat 

6: A« <- S^_ (UW) , 

7: rjW <— pp^\ 

8: until stopping criterion is satisfied 

9: t i — t T" 1. 

10: until stopping criterion is satisfied 

ll: return The optimal sparse coefficients A*. 


In order to search the optimal solution of Eq. ( [21) , SpaRSA 
generates a sequence of iterations A®, t = 1 , 2 ,..., by 
solving the following subproblem 


A< t+1 ) Earg min p (z - A^) V/(A^) + ^- ||Z-A^ (Z), 

(23) 

where rj^ > 0 is a nonnegative scalar such that rj^ = 
and p > 1. The Eq. (23) can be simplified into the 
following form by eliminating the terms independent of Z 


1 


mm 


^|Z-U«||| + 7 U(Z) 


« 


(24) 


zem JV xP 2 "* py 

where = A®-4jV/(A^). The optimization problem 

in Eq. ( [24] ) is separable element-wise, which can be reformu¬ 
lated into 

min I(%-w-1) 2 +-A^( z ), Vi s 1,..., N and j = 1,..., P. 

A-ij a py ' 

(25) 

The problem in Eq. ( [25] ) has a unique solution and can be 
solved by the well-known soft thresholding operator S(-) 


z ij = S ^J ( 4 ?) = s ign(Uij) max{ 0 , \uij\ - A}. (26) 

Comparing with the algorithm proposed in (23), which is 
based on the coordinate descent, Laplacian sparse recovery 
using SpaRSA is more computationally efficient since it is 
able to cheaply search for a better descent direction V/(A). 
The corresponding optimization is stated in Algorithm [2] 


B. Dictionary Update 


In order to adjust the dictionary, we now follow Eq. © 
to derive using the fixed point differentiation. Applying 
differentiation on Eq. ([19]) on the fixed point A 


d||X-DA ||!+ 7 tr (ALA t ) 


dA 


= -X- 


A=A 


dA 


\A= 


hi) 


In the following part, we omit the fixed point notation. By 
computing the derivation and then applying the vectorization 
on Eq. ( [27) , we have 


vec (D t (X - DA) - 7 AL) = A • vec (sign (A)). (28) 


( 22 ) 
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Algorithm 3 Stochastic gradient descent algorithm for task- 
driven dictionary learning with Laplacian sparsity prior 


Require: Initial dictionary D and classifier W. Parameter A, 
p and to. 

for t = 1 to T do 

Draw one sample (X, y c ) from training set. 


Find sparse code A according to Eq. (IQ). 
Find the active set A 


A <- {i : vec(A)i ^ 0, i G {1,..., NP }}, 

where vec(A)i is the i th element of vec(A). 

5: Let (3 G R NxP . Set vee(/3) A c = 0 and compute 

vec(f3) A 

vec(f3) A = (I P (g)D T D+7L(8)I A r)X>ec(W T (WA-Y)) 

and (g) denotes the Kronecker product. 

6: Choose the learning rate p t <— min(p, Py). 

7: Update the parameters by gradient projection step 

W ^ W - Pt ((Wa c - y)aj + fiW), 

D •<- D — p t (— D/3A t + (X - DA)/3 t ), 


(initial step size), N (dictionary size) and P (number of 
neighboring pixels), would introduce significant computational 
cost. Instead, we search for the optimal values for the above 
parameters according to the following procedure. 

• The candidate dictionary sizes are from 5 to 10 atoms 
per class. The choice of dictionary size depends on the 
classification performance and computational cost. In our 
experiment, we set the dictionary size to be 5 atoms per 
class. 

• Searching for the optimal window size and the regulariza¬ 
tion parameters would be cumbersome. Empirically, we 
found that the optimal regularization parameters are less 
likely to be affected by the choice of the window size. 
Therefore, for each image, we fix the window size to be 

5 3 x 3 in order to save computational resource during the 

search of the optimal regularization parameters. Candi¬ 
date regularization parameters are {l0 -3 ,10 -2 ,10 -1 }. 

• The possible candidate window sizes are 3 x 3, 5x5, 
7x7 and 9x9. We search for the optimal window size 
for each image after finding the optimal regularization 
parameters. 


and normalize every column of with respect to 

£ 2 -norm. 

8 : end for 

9: return D and W. 


The differentiation dvec ^n{A)) not we q defined on zero 
points of vec (sign (A)). Similar as in TDDL-JS, we set the i th 


element- ^ 1 ])i = 0 when vec (sign (A)) • = 0. Denote 

the A as the index set of nonzero elements of vec (sign (A)). 
Compute the derivative of Eq. ([28]) with respect to D mn 


d {vec (D t (X - DA) - 7 AL) a } 


dD n 


which leads to 

SD t D 


vec 


dD r . 


A- 


<9D t X 

c)D 


D D 


<9A 

0D rn , 


= 0, 


_dA_ 

'dD rnr 


(29) 


= 0. 

60) 


Now we reach the desired gradient 
<9A 


vec 


(ip' 


dD„ 


D t D 


■ 7 L ( 





By applying algebraic simplification to Eq. ( [3l] ), which is 
shown in Appendix |VII| we reach the optimzation for TDDL- 
LP as stated in the Algorithm [ 3 ] It should be noted that A 
and Y have the same definitions as those in Algorithm [l] 


V. EXPERIMENTS 


A. Experiment Setup 

Cross-validation to obtain the optimal values for all pa¬ 
rameters, including A, e, 7 (sparse coding regularization pa¬ 
rameters), p (regularization parameter for the classifier), po 


TABLE I: Parameters Used in the Paper 


Structured Priors 

A 

7 

p 

£1 

icr 2 


10“ 2 

JS 

(M 

1 

O 

1—1 


10“ 3 

LP 

IQ” 2 

10- a 

IQ- 1 


Computing the gradient for a single training sample at each 
iteration of Algorithm [T] or [3] will make the algorithm converge 
very slowly. Therefore, following the previous work (25), (27), 
we implement the two proposed algorithms with the mini¬ 
batch method, where the gradients of multiple training samples 
are computed in each iteration. For the unsupervised learning 
methods, the batch size is set to 200. For the supervised learn¬ 
ing methods, the batch size is set to 100 and to = T/ 10. We 
search the optimal regularization parameters for each image 
and found that their optimal values are coincidentally the same. 
The reason could be due to our choice of a large interval 
for the search grid. The regularization parameters used in our 
paper are shown in Table |l| We set p = 10 -4 . As a standard 
procedure, we evaluate the classification performance on HSI 
image using the overall accuracy (OA), average accuracy (AA) 
and kappa coefficient (k). The classification methods that are 
tested and compared are SVM, SRC, SRC with joint sparsity 
prior (SRC-JS), SRC with Laplacian sparsity prior (SRC- 
LP), unsupervised dictionary learning (ODL), unsupervised 
dictionary learning with joint or Laplacian sparsity prior 
(ODL-JS, ODL-LP), TDDL, TDDL-JS and TDDL-LP. During 
the testing stage, all training pixels are excluded from the HSI 
image, which means there may be some ‘holes’ (training pixels 
deleted) inside a neighborhood window. This is reasonable 
since we do not want the classification results to be affected 
by the spatial distribution of the labelled samples. We use 
SPAMS toolbox (43) to perform the joint sparse recovery 
via the Fast Iterative Shrinkage-Thresholding Algorithm [41]. 
The sparse recovery for SRC-based methods are performed 
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via the Alternating Direction Method of Multipliers (39) . The 
modified SpaRSA shown in Algorithm [2] is used to solve the 
Laplacian sparse recovery problem. 

For the unsupervised dictionary learning methods, the dic¬ 
tionary is initialized by randomly choosing a subset of the 
training pixels from each class and updated using the online 
dictionary learning (ODL) procedure in (26) . The classifier’s 
parameter are then obtained by using a multi-class linear 
regression. For the supervised dictionary learning methods, 
the dictionary and classifier’s parameter are initialized by the 
training results of ODL for the unsupervised method. 



Fig. 3: The effect of different window sizes for the Indian Pine 
image. The dictionary size is fixed at five atoms per class. 



(a) (b) 


Fig. 1: (a) Training sets and (b) test sets of the Indian Pine 
image. 


TABLE II: Number of training and test samples for the Indian 
Pine image 


Class # 

Name 

Train 

Test 

1 

Alfalfa 

6 

48 

2 

Corn-notill 

137 

1297 

3 

Corn-min 

80 

754 

4 

Corn 

23 

211 

5 

Grass/Pasture 

48 

449 

6 

Grass/Trees 

72 

675 

7 

Grass/Pasture-mowed 

3 

23 

8 

Hay-windrowed 

47 

442 

9 

Oats 

2 

18 

10 

Soybeans-notill 

93 

875 

11 

Soybeans-min 

235 

2233 

12 

Soybean-clean 

59 

555 

13 

Wheat 

21 

191 

14 

Woods 

124 

1170 

15 

Building-Grass-Trees-Drives 

37 

343 

16 

Stone-steel Towers 

10 

85 

Total 

997 

9369 



Fig. 2: The result with different dictionary sizes for the Indian 
Pine image. 


B. Classification on AVIRIS Indian Pine Dataset 


We first perform HSI classification on the Indian Pine im¬ 
age, which is generated by Airborne Visible/Infrared Imaging 
Spectrometer (AVIRIS). Every pixel of the Indian Pine consists 
of 220 bands ranging from 0.2 to 2.4/rni, of which 20 water 
absorption bands are removed before classification. The spatial 
dimension of this image is 145 x 145. The image contains 
16 ground-truth classes, most of which are crops, as shown 
in Table [n] We randomly choose 997 pixels (10.64% of all 
the interested pixels) as the training set and the rest of the 
interested pixels for testing. 

The total iterations of unsupervised and supervised dictio¬ 
nary learning methods are set to 15 and 200 respectively for 
this image. The classification results with varying dictionary 
size N are shown in Fig. [2] In most cases, the classification 
performance increases with the increment in the dictionary 
size. All methods attain their highest OA when the dictionary 
size is 10 atoms per class. The OA of ODL-JS, ODL-LP, 
TDDL-JS and TDDL-LP do not change much when the dic¬ 
tionary size increase from 5 to 10 atoms per class. Therefore, 
it is reasonable to set the dictionary size to be 5 atoms per 
class by taking computational cost into account. Fig. [2] also 
suggests that a plausible performance can be obtained even 
when the dictionary is very small and not over-complete. The 
classification performance with respect to the window size is 
demonstrated in Fig. [3] Using a window size of 5 x 5, ODL-JS 
and TDDL-JS achieves the highest OA of 88.36% and 92.65%, 
respectively. When the window size is set to 7 x 7, the ODL- 
LP and TDDL-LP reach their highest OA = 91.39% and OA 
= 94.20%, respectively. ODL-JS and TDDL-JS reach better 
performance when the window size is not larger than 5x5. The 
TDDL-LP outperforms all other methods when the window 
size is 7 x 7 or larger. Since a larger window size has more 
chances to include non-homogeneous regions, it verifies our 
argument that the Laplacian sparsity prior works better for 
classifying pixels lying in the non-homogeneous regions. 

Detailed classification results of various methods are shown 
in Table [Til] and visually displayed in Fig. [4] The OA of 
ODL-LP reaches 91.39%, which is more than 20% higher 
than that of ODL and 3% higher than that of ODL-JS. The 
TDDL-LP has the highest classification accuracy for most 
classes. Most methods have 0% accuracy for class 9 since 
there are too few training samples in this class. The overall 
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(a) SVM, OA = 64.94% 


(b) SRC, OA= 71.17% 


(d) SRC-LP, OA = 79.40% 


(e) ODL, OA = 71.04% 


(f) ODL-JS, OA = 88.36% (g) ODL-LP, OA = 91.39% (h) TDDL, OA = 81.43% (i) TDDL-JS, OA = 92.65% (j) TDDL-LP, OA = 94.20% 

Fig. 4: Classification map of the Indian Pine image obtained by (a) SVM, (b) SRC, (c) SRC-JS, (d) SRC-LP, (e) ODL, (f) 
ODL-JS, (g) ODL-LP, (h) TDDL, (i) TDDL-JS and (j) TDDL-LP 


TABLE III: Classification accuracy (%) for the Indian Pine image 


Dictionary Size 

N = 997 

N = 80 

Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

77.08 

68.75 

79.17 

82.42 

75.00 

97.92 

70.83 

50.00 

35.42 

56.25 

2 

84.96 

58.84 

81.94 

81.34 

59.69 

91.24 

94.26 

84.03 

94.57 

93.95 

3 

62.67 

24.40 

56.67 

47.35 

62.93 

81.20 

84.40 

69.73 

84.13 

92.13 

4 

8.57 

49.52 

27.62 

49.76 

23.81 

47.62 

61.90 

14.76 

79.05 

46.19 

5 

77.18 

81.88 

85.46 

83.96 

82.55 

93.29 

92.62 

89.04 

90.16 

90.83 

6 

91.82 

96.88 

98.36 

97.48 

88.24 

99.55 

98.96 

98.66 

99.55 

98.96 

7 

13.04 

0.00 

0.00 

0.00 

4.35 

17.39 

0.00 

0.00 

0.00 

95.65 

8 

96.59 

96.59 

100.00 

99.55 

96.36 

99.32 

99.32 

99.09 

100.00 

100.00 

9 

0.00 

5.56 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

10 

71.30 

24.00 

18.94 

31.89 

67.51 

77.73 

91.04 

72.90 

90.13 

94.03 

11 

35.25 

96.22 

91.63 

94.58 

67.94 

88.25 

94.10 

85.46 

96.22 

97.37 

12 

42.39 

32.97 

45.29 

64.68 

80.62 

88.59 

83.15 

59.06 

86.78 

95.47 

13 

91.05 

98.95 

99.47 

99.48 

95.79 

100.00 

100.00 

100.00 

100.00 

100.00 

14 

94.85 

98.97 

98.97 

99.49 

87.20 

97.77 

99.14 

98.11 

99.40 

99.40 

15 

30.70 

49.71 

55.85 

63.84 

32.16 

70.76 

67.84 

47.66 

77.78 

82.75 

16 

27.06 

88.24 

95.29 

97.65 

69.41 

96.47 

85.88 

92.94 

91.76 

98.82 

OA[%] 

64.94 

71.17 

76.41 

79.40 

71.04 

88.36 

91.39 

81.43 

92.65 

94.20 

AA[%] 

56.53 

60.72 

64.67 

64.67 

62.10 

77.94 

82.18 

66.43 

76.56 

83.86 

K 

0.647 

0.695 

0.737 

0.712 

0.691 

0.851 

0.907 

0.8087 

0.924 

0.940 


performance of TDDL-JS and TDDL-LP have at least 13% 
improvement over the other conventional dictionary learning 
techniques. TDDL-LP significantly outperforms other methods 
on the classes that occupy small regions in the image. The class 
7 (Grass/Pasture-mowed), lying in a non-homogeneous region, 
has only 3 training samples and 23 test samples. The TDDL- 
LP is capable of correctly classify 95.65% test samples while 
the second highest accuracy is only 17.39%. We notice that 
the AA of both ODL-LP (82.18%) and TDDL-LP (83.86%) 
are at least 4% higher than that of the other methods. This 
also suggests that the Laplacian-sparsity-enforced dictionary 
learning methods work better on non-homogeneous regions, 
since the AA can only attain high value when both the most 
regions reach high accuracy. 

C. Classification on ROSIS Pavia Urban Data Set 

The last two images to be tested are the University of Pavia 
and the Center of Pavia, which are urban images acquired by 
the Reflective Optics System Imaging Spectrometer (ROSIS). 
It generates 115 spectral bands ranging from 0.43 to 0.86/im. 

The University of Pavia image contains 610 x 340 pixels. 



Fig. 5: (a) Training sets and (b) test sets of the University of 
Pavia image. 


12 noisiest bands out of all 115 bands are removed. There are 


nine ground-truth classes of interests as shown in Table IV 


For this image, the training samples were manually labelled 
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(f) ODL-JS, OA = 75.83% (g) ODL-LP, OA = 78.15% 


(h) TDDL, OA = 69.30% 


(i) TDDL-JS, OA = 84.48% 


(j) TDDL-LP, OA = 85.70% 


Fig. 6: Classification map of the University of Pavia image obtained by (a) SVM, (b) SRC, (c) SRC-JS, (d) SRC-LP, (e) 
ODL, (f) ODL-JS, (g) ODL-LP, (h) TDDL, (i) TDDL-JS and (j) TDDL-LP 


TABLE V: Classification accuracy (%) for the University of Pavia image 


Dictionary Size 

N = 3921 

N = 45 

Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

84.55 

57.11 

77.04 

95.08 

39.16 

86.64 

79.38 

74.60 

79.27 

87.77 

2 

82.45 

58.22 

67.98 

66.70 

66.37 

56.48 

75.89 

51.27 

86.85 

78.89 

3 

77.08 

57.33 

44.32 

77.55 

65.40 

80.72 

62.42 

77.19 

71.13 

78.79 

4 

94.19 

95.94 

95.13 

95.19 

78.67 

99.04 

96.91 

98.08 

98.87 

98.21 

5 

99.01 

100.00 

99.85 

100.00 

99.91 

100.00 

99.82 

99.91 

99.91 

99.91 

6 

23.55 

89.60 

88.31 

96.60 

64.94 

96.89 

72.13 

90.07 

68.74 

91.64 

7 

2.06 

83.27 

96.59 

96.59 

91.64 

91.23 

84.10 

86.14 

68.09 

93.17 

8 

33.89 

48.65 

65.20 

67.36 

67.36 

90.81 

75.98 

78.00 

95.54 

94.20 

9 

53.05 

93.69 

99.59 

99.59 

71.07 

98.37 

93.46 

95.72 

91.82 

95.09 

OA[%] 

69.84 

66.51 

74.05 

80.82 

64.57 

75.83 

78.15 

69.30 

84.48 

85.70 

AA[%] 

61.09 

75.98 

80.06 

88.80 

71.66 

88.91 

82.23 

83.44 

84.47 

90.85 

K 

0.569 

0.628 

0.681 

0.758 

0.549 

0.731 

0.747 

0.662 

0.817 

0.835 


by an analyst. The total number of training and testing 
samples is 3,921 (10.64% of all the interested pixels) and 
40,002 respectively. The training and testing map are visually 
displayed in Fig. [5] 

For the University of Pavia, we set the total iterations of 
unsupervised and supervised dictionary learning methods to 
be 30 and 200 respectively. The window size is set to 5 x 5 
for all joint or Laplacian sparse regularized methods to obtain 
the highest OA. The ODL-LP is able to reach a performance 
of 78.15% for OA, which is more than 14% higher than that of 
ODL. The ODL-JS also significantly improves the OA, which 


is more than 11% higher than that of ODL. TDDL-LP has 
the highest OA = 85.70%, which indicates that it outperforms 
other methods when classify large regions of the image. It also 
has the highest hi = 0.935. The best classification accuracy 
for class 1 (Asphalt), which consists of narrow strips, is 
obtained by using TDDL-LP (87.77%). Class 2 (Meadows) 
is composed of large smooth regions, as expected, TDDL- 
JS gives the highest accuracy (86.85%) for this class. TDDL 
has large amount of misclassification pixels for class 2. The 
highest A A (90.85%) is given by TDDL-LP, which confirms 
that the TDDL-LP is superior to other methods when classify 












IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 


10 


TABLE IV: Number of training and test samples for the 
University of Pavia image 


Class # 

Name 

Train 

Test 

1 

Asphalt 

548 

6304 

2 

Meadows 

540 

18146 

3 

Gravel 

392 

1815 

4 

Trees 

524 

2912 

5 

Metal sheets 

265 

1113 

6 

Bare soil 

532 

4572 

7 

Bitumen 

375 

981 

8 

Bricks 

514 

3364 

9 

Shadows 

231 

795 

Total 

3921 

40002 



Fig. 7: (a) Training sets and (b) test sets of the Center of Pavia 
image. 


TABLE VI: Number of training and test samples for the Center 
of Pavia image 


Class # 

Name 

Train 

Test 

1 

Water 

745 

64533 

2 

Trees 

785 

5722 

3 

Meadows 

797 

2094 

4 

Bricks 

485 

1667 

5 

Soil 

820 

5729 

6 

Asphalt 

678 

6847 

7 

Bitumen 

808 

6479 

8 

Tile 

223 

2899 

9 

Shadows 

195 

1970 

Total 

5536 

97940 


the pixels in non-homogeneous regions. 

The third image where we evaluate various approaches is 
the Center of Pavia, which consists of 1094 x 492 pixels. 
Each pixel has 102 bands after removing 13 noisy bands. 
This image consists of nine ground-truth classes of interest 
as shown in Table [VI] and Fig. [7] 5, 536 manually labelled 
pixels are designated as the training samples and the remaining 
97,940 interested pixels are used for testing. 

Since this image has more labeled samples than the other 
two images, we set the total iterations of unsupervised and 
supervised dictionary learning methods to be 75 and 1000 
respectively. The window size is set to 5 x 5 for the joint 
sparse and Laplacian sparse regularized methods. Although 


the OA of most methods are close, the OA of ODL-JS and 
ODL-LP are still around 3% higher than that of ODL. The 
TDDL-LP reach the highest OA = 98.67% over all the other 
methods. The OA of TDDL-JS (98.01%) is slightly lower 
than that of the TDDL-LP. We notice that SRC-JS (OA = 
98.01%) and SRC-LP (OA=98.36%) also render competitive 
performance when compared to TDDL-JS and TDDL-LP due 
to the fact that the raw spectral features of this image is already 
highly discriminative. TDDL-LP outperforms other methods 
on almost all classes and works especially well for Class 
4 (Bricks), achieving highest accuracy of 97.41%. Except 
for SRC-LP where the accuracy is 94.72%, none of others 
reaches accuracy over 90% for Class 4. Additionally, the AA 
of TDDL-LP (97.21%) is almost 2% better than that of TDDL- 
JS (95.68%). These results support our assertion that the 
Laplacian sparsity prior provides stronger discriminability on 
nonhomogeneous regions. Performance comparison between 
the SRC-based and TDDL-based methods have shown that 
the dictionary size can be drastically decreased by applying 
supervised dictionary learning while achieving even better 
performance. 


VI. CONCLUSION 

In this paper, we proposed novel a task driven dictionary 
learning method with joint or Laplacian sparsity prior for HSI 
classification. The corresponding optimization algorithms are 
developed using fixed point differentiation, and are further 
simplified for ease of implementation. We also derived the 
optimization algorithm for solving the Laplacian sparse recov¬ 
ery problem using SpaRS A, which improves the computational 
efficiency due to the availability of a more accurate descent 
direction. The performance and the behavior of the proposed 
methods, i.e. TDDL-JS and TDDL-LP, have been extensively 
studied on the popular hyperspectral images. The results 
confirm that both TDDL-JS and TDDL-LP give plausible 
results on smooth homogeneous regions, while TDDL-LP one 
works better for classifying small narrow regions. Compared 
to TDDL-JS, TDDL-LP is able to obtain a more stable 
performance by describing the similarities of neighboring 
pixels’ sparse codes more delicately. The results also confirm 
that a significantly better performance can still be achieved 
when joint or Laplacian prior is imposed by using a very 
small dictionary. The overall accuracy of our algorithm can be 
improved by applying kernelization to the proposed approach. 
This can be achieved by kernelizing the sparse representation 
]44| and using a composite kernel classifier |45| . 


VII. Appendix A 

We can infer from Eq. ( j~j~8] ) that vec ( dA/dD mn ) = 0, Vn G 
A c , which indicates dC/dD mn = 0, Vn G A c . Therefore, 
we only need to take the gradient dC/dD mn , Vn G A into 
account. 

From the Eq. ( [13] ) and Eq. ( [18] ), we achieve the gradient for 
every element of D, 


dC (dC 

—~- = vec —— 

dDmn \dA 


T 


• vec 


dA 

r)f) 

UJ - y mn 


(32) 
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(f) ODL-JS, OA = 96.13% (g) ODL-LP, OA = 97.86% (h) TDDL, OA = 96.30% (i) TDDL-JS, OA = 98.01% (j) TDDL-LP, OA = 98.67% 


Fig. 8: Classification map of the Center of Pavia image obtained by (a) SVM, (b) SRC, (c) SRC-JS, (d) SRC-LP, (e) ODL, 
(f) ODL-JS, (g) ODL-LP, (h) TDDL, (i) TDDL-JS and (j) TDDL-LP 


where m = 1 ,..M and n = 1,..., N\. Let g = 

and W = W A is 


vec 

the A 


($r) = vec f (wA — y) T W 

A columns of W. Expand Eq. (fl8| 


Eq. ( [32] ), we have 

DC 


. Expand Eq. © and combine it with 


0D n 


— Umn Vm 


, dC 

and —— = U — V, 
<9D 


(33) 


where U,V E M mxJVA and every element U mn , V mn are 
defined as 


U mn = g T (d t D ® Ip + Ar) 1 vec ((X - DA) t E ron ) , 
V mn = g T (d t D 0 Ip + Ar) _1 vec (a t E^d) , 


Consider the simplification for U first 

U mn = g T (d t D 0 Ip + Ar) _1 (ET n 0 Ip) vec ((X - DA) 
= g T F”i;ec ((X - DA) T ) _ , (34) 

where F = ^D T D 0 Ip + Ar) ; m(m) = {(m — 1)P + 

1,..., mP}, h(n) = {{n — 1)P+1,..., nP} denote the index 
sets; F n are the n columns of F. 

Let = vec ((X — DA) T j . It can be shown that 

V / fri 

is the m ih row of (X - DA). Now the (m, n) element f/ mn 
of the first part U can be written as 


where E mn e M Mx7Va is the indicator matrix that element 

(m, n) of E mn is 1 and all other elements are zero. Umn — (g T F) Cm 5 


( 35 ) 
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TABLE VII: Classification accuracy (%) for the Center of Pavia image 


Dictionary Size 

iV = 5536 



N 

= 45 



Class 

SVM 

SRC 

SRC-JS 

SRC-LP 

ODL 

ODL-JS 

ODL-LP 

TDDL 

TDDL-JS 

TDDL-LP 

1 

96.97 

99.58 

99.52 

99.28 

96.26 

99.13 

99.69 

98.54 

98.76 

99.13 

2 

91.09 

90.07 

96.89 

92.11 

84.25 

94.63 

90.63 

89.55 

97.59 

93.01 

3 

96.08 

95.42 

99.47 

98.62 

93.36 

96.23 

97.61 

95.18 

96.85 

98.71 

4 

86.32 

79.96 

78.28 

94.72 

61.61 

64.73 

97.30 

85.78 

85.18 

97.41 

5 

88.57 

93.70 

97.05 

97.14 

89.40 

84.62 

90.00 

88.08 

98.25 

99.59 

6 

95.27 

95.62 

98.19 

97.18 

94.35 

95.03 

94.49 

94.39 

99.36 

99.18 

7 

94.03 

93.86 

97.01 

96.84 

86.31 

86.90 

97.33 

91.65 

94.46 

98.66 

8 

99.83 

99.17 

99.66 

99.66 

96.76 

99.79 

99.00 

98.17 

99.38 

99.73 

9 

85.74 

98.58 

99.19 

99.95 

93.25 

90.56 

94.42 

95.53 

91.27 

95.61 

OA[%] 

95.68 

97.57 

98.01 

98.36 

93.67 

96.13 

97.86 

96.30 

98.01 

98.67 

AA[%] 

93.77 

94.00 

95.03 

97.28 

88.39 

90.18 

95.61 

92.99 

95.68 

97.89 

K 

0.923 

0.961 

0.965 

0.971 

0.899 

0.938 

0.965 

0.940 

0.968 

0.979 


Stacking all elements of U 


u = 


-(g T F)^M 


(e t f) N a ^ 

(g T F 


M- 


dC \ 1 ( dA 

dAj, ' VeC [dD n 


(40) 


A \ UJ -^mn J a 

Expand Eq. ([3TJ and combine it with Eq. ( [40] ), the desired 
gradient is 


= £ (g T F) iT ...(g T F)^ 


(36) 


where A n denotes the n th element of set A. 

Now consider simplification for V. Each element V mn of 
V can be written as 


where 


V mn = g T F • vec (A T E^ n D j 

= g T F ^D T E mn 0 I P ) vec (A T ) 
= g T F (D^ 0 lp) Al, 


9C _ TT T i 9C _ TT ^ 

07-1 Unin Vmn tind U V, 

vDmn 


U mn = g T F-W (E^ (X - DA)) a , 
v mn = g T F _1 vec (D T E mn A) A , 


(41) 


F = (ip 0 D t D + 7 L 0 Ijv 


-1 

*A,A ' 


(37) 


Let g has the same definition as that in Section VII The first 

df 


IS 


where A n is the n th row of A and D m is the m th row of D. 
Stacking every element of V, such that 


part u of 

U mn =g r Fvec{El n (X — DA)) a 
= (g T F) flWC (X-DA)- ( 


(42) 


V = 


g T F (l)J 0 I F ) 


g T F (£>], 0lp) 
r EiiD3- n (g T F) B AT 


LXLiPL (g T F) fl AT 
D[(g T F)^...(g T F)T 


g T F (pj 0 Ip) 
g T F (f»T 0lp) 


EtiD7„(g T F) ft A 


t n 

N 


E mn C R MxiV is the indicator matrix that the (m, n) element 
is 1 and all other elements are zero, m and n are defined as 
the following index sets, 

m(m, n) = {m,..., m + pM,... Vp s.t. n + pN G A 
n(n) = {n, n + A,..., n + (P — 1)N} D A 


TtLl^Mn (g T F) ft ATj 

(38) 


Eq. (42) can be further simplified by introducing h( n ) G M p , 
such that 


where p(p) = {p,p + P, ... ,p + (N — 1)P}. Combining Eq. 
and Eq. pS]) 


h (n) = f (g T F) n+pJV , if n + pN £ n(n),Vp 

1 0 , otherwise 


|g = u - V = - D/3 a A t and = tf3 T - D/3A T 

(39) 

where f3 Ac = 0 and (3 A = [(g T F)^ , • • • , (g T F)^] T . More 

generally, we have defined /3 A G R NaxP such that vec(fi\) = 

Fg. 

VIII. Appendix B 

The gradient for updating the dictionary can be written as 

dA 


Now Eq. ( [42] ) can be rewritten as, 

TT — 

u mn — 11 s m) 


(43) 


(44) 


where £ ' is the m th row of X — DA. The first part U of the 

df_ 

6>D 

r ^ h d) 


gradient Mr can be obtained by stacking all U mn in Eq. d44 ) 


U = 


dC fdC 

dD mn - vec Ua 


0D„ 


= € 


= £/3 r , 


,&h (1) ••• 

h«-hW 


x^hW 


(45) 
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where we define (3 = , h^] 7 " G R NxP . By 

examining the nonzero elements position of h^ 1 ),..., h^, 
it is not difficult to find the relation between (3 and g T F 


vec (f3) A = Fg and vec {/3) Ac = 0. (46) 

Now consider the second term V mn of 

^ mn 

Vmn = (g T F) (ip <8> D T E mn ) A A vec(A) A 

= (g T F) ( [D m vec(A) n ... D mvec( A)( n+ (p_ 1 ) j /v)] ) A 

p 

= D m y^A n , p /3 p , (47) 

P= 1 


where (3 P is the p th column of (3. The differentiation can 
be derived from V mn in Eq. © 


V = 


Di 


Ep=i A llP /3 p , • • • , Ep=i A.n, p F 


Dm [Ep=i Ai )P (Fg) p • • • E; =1 A n>p (Fg)^ 

" P P 

_p=i p =i 

= D/3A t , 


= D 


(48) 


Combining Eq. <|45ji and Eq. <(48ji, we reach the gradient of 
the dictionary 


dC_ 

<9D 


= £/3 t — D/3A t . 


(49) 
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